Reading, Scraping, and Cleaning Our Data

This notebook shows our process of data munging in more detail.

Packages

In [1]:
# Loading in data:
import numpy as np
import pandas as pd

# Parsing:
import requests
import requests_cache
import lxml
from bs4 import BeautifulSoup
import bs4
import re

I. Reading in the Data - Our Primary Data Source

The following cell loads are data through read_csv and takes a long time to run, about 15 minutes on our devices. We use street.h5 to import the data faster after importing the csv once here.

In [ ]:
# Should only run this cell the first time running this notebook:

# Convert these to datetimes
parseDates = ["Opened", "Closed", "Updated"] 
street_csv = pd.read_csv("Street_and_Sidewalk_Cleaning.csv", 
                         #nrows = 100000,
                         parse_dates=parseDates)

street = street_csv
# Export h5 for faster reading in later
street.to_hdf('street.h5','table',append=False) 

The next cell loads the h5 and displays the first few rows of our primary data source. All of the variables can be seen here.

In [2]:
# Read h5: Run every second time and on of running this notebook
street_hdf = pd.HDFStore('street.h5')
street = street_hdf.select('/table')
street.head()
Out[2]:
CaseID Opened Closed Status Responsible Agency Address Category Request Type Request Details Source Supervisor District Neighborhood Updated Point
0 322566 2008-11-30 22:56:00 2009-07-21 16:24:00 Closed DPW Ops Queue 1566 HYDE ST, SAN FRANCISCO, CA, 94109 Street and Sidewalk Cleaning Sidewalk_Cleaning Garbage Voice In 3.0 Nob Hill 2009-07-21 16:24:00 (37.795328529, -122.418067787)
1 322560 2008-11-30 22:24:00 2009-07-21 16:24:00 Closed DPW Ops Queue Intersection of TAYLOR ST and TURK ST Street and Sidewalk Cleaning Illegal_Dumping Garbage Voice In 6.0 Downtown/Civic Center 2009-07-21 16:24:00 (37.7832142923006, -122.410764953722)
2 322559 2008-11-30 22:20:00 2008-12-01 06:07:00 Closed DPW Ops Queue 1601 SACRAMENTO ST, SAN FRANCISCO, CA, 94109 Street and Sidewalk Cleaning Illegal_Dumping Mattress Voice In 3.0 Nob Hill 2008-12-01 06:07:00 (37.791643509125, -122.419260423291)
3 322556 2008-11-30 21:34:00 2008-12-09 08:07:00 Closed DPW Ops Queue Intersection of 25TH ST and LUCKY ST Street and Sidewalk Cleaning Sidewalk_Cleaning Human_waste_or_urine Voice In 9.0 Mission 2008-12-09 08:07:00 (37.7509362479361, -122.413364630104)
4 322545 2008-11-30 20:45:00 2009-02-03 07:09:00 Closed DPW Ops Queue Intersection of 19TH ST and OAKWOOD ST Street and Sidewalk Cleaning Sidewalk_Cleaning Empty_cart Voice In 8.0 Mission 2009-02-03 07:09:00 (37.7598935230725, -122.42486308879)

Cleaning the data

To keep counts of requests consistent over the months and days, we remove years with incomplete data since our original data set starts in November 2008 and ends in January 2017.

In [3]:
street = street.loc[street['Opened'].dt.year != 2008]
street = street.loc[street['Opened'].dt.year != 2017]
street = street.sort_values("Opened") # Re-sort data in date order
street = street.reset_index() # Re-set the indexing
street = street.drop('index', 1) # Get rid of old index column 

Some neighborhood names need to be changed to allow merging with other data sources later and to match our shape files for geographic plots.

In [4]:
street.ix[street.Neighborhood == "None", "Neighborhood"] = np.nan
street.ix[street.Neighborhood == "Ocean View", "Neighborhood"] = "Oceanview"
street.ix[street.Neighborhood == "Downtown/Civic Center", "Neighborhood"] = "Civic Center"
street.ix[street.Neighborhood == "St. Mary's Park", "Neighborhood"] = "St. Marys Park"
street.ix[street.Neighborhood == "Presidio", "Neighborhood"] = "Presidio Heights"

Next we create a month column for future plots and data merging.

In [5]:
street['month'] = [timestamp.month for timestamp in street.Opened]

A look of the first few rows of our cleaned data:

In [6]:
street.head()
Out[6]:
CaseID Opened Closed Status Responsible Agency Address Category Request Type Request Details Source Supervisor District Neighborhood Updated Point month
0 342509 2009-01-01 08:30:51 2009-01-01 11:07:06 Closed DPW Ops Queue Intersection of 13TH ST and FOLSOM ST Street and Sidewalk Cleaning Sidewalk_Cleaning Encampment Voice In 6.0 Mission 2009-01-01 11:07:06 (37.7695911772607, -122.415577110949) 1
1 342510 2009-01-01 08:33:46 2009-01-01 11:07:06 Closed DPW Ops Queue Intersection of 13TH ST and FOLSOM ST Street and Sidewalk Cleaning Sidewalk_Cleaning Debris_filled_carts Voice In 6.0 Mission 2009-01-01 11:07:06 (37.7695911772607, -122.415577110949) 1
2 342512 2009-01-01 08:44:54 2009-01-31 13:09:53 Closed DPW Ops Queue 467 FILLMORE ST, SAN FRANCISCO, CA, 94117 Street and Sidewalk Cleaning Street_Cleaning Glass Voice In 5.0 Western Addition 2009-01-31 13:09:53 (37.773807246, -122.431027495) 1
3 342514 2009-01-01 09:13:07 2009-01-01 11:07:06 Closed DPW Ops Queue Intersection of DWIGHT ST and GOETTINGEN ST Street and Sidewalk Cleaning Sidewalk_Cleaning Garbage Voice In 9.0 Excelsior 2009-01-01 11:07:06 (37.7232896018615, -122.405086927628) 1
4 342519 2009-01-01 09:21:05 2009-01-21 06:07:13 Closed DPW Ops Queue 1610 MCALLISTER ST, SAN FRANCISCO, CA, 94115 Street and Sidewalk Cleaning Sidewalk_Cleaning Human_waste_or_urine Voice In 5.0 Western Addition 2009-01-21 06:07:13 (37.777956377, -122.43893262) 1

We then exported the cleaned data set for use in the analysis notebook.

In [7]:
# Export h5 for faster reading in later
# Warning: the resulting file is about 600 MB
street.to_hdf('streetCleaned.h5','table',append=False)

II. Scraping - Demographic Data

Here we scrape demographic data from the source talked about on our website.

In [8]:
requests_cache.install_cache('sf_cache') # So we don't make too many requests to the website

Grab the html from the website:

In [11]:
url = "http://www.city-data.com/nbmaps/neigh-San-Francisco-California.html"
response = requests.get(url)
response.raise_for_status()

We first parse using Beautiful Soup and lxml. Then we grab the section with what information we want for each neighborhood and store it as a list in neighborhood_divs.

In [12]:
neighborhoods_bs = BeautifulSoup(response.text, 'lxml')
neighborhood_names = neighborhoods_bs.find_all(name = "span", attrs={'class':'street-name'})
neighborhood_names = [name.text.replace("-", " ") for name in neighborhood_names]
neighborhood_divs = neighborhoods_bs.body.find_all(name = "div", attrs={'class':'neighborhood'})

Then, parsing the html for each neighborhood was not as simple as we thought. The format the website used involved a lot of navigable strings. The title of each section like "Area" and "Population" was not nested nicely. The value for the Area was stored as a navigable string as a sibling to Area instead of a child, so the parsing process got a bit more complicated. The function catch() was created to handle three different cases of how far the sibling of information was from the title. It is called catch because if the case fails, then the value should be NA, so the function makes use of try and except. The function also deals with converting the strings to numbers and removing any commas and dollar signs that get in the way.

In [13]:
def catch(line, number, simple, c = "h", complicated = False):
    """
    This function was made to catch the cases when scraping where we want NAs. If grabbing what we want from line fails, 
    then NA is returned. If number is True, then the value we are grabbing needs to be converted from a string to a number. 
    If simple is True, then the value we are scraping is right there, otherwise we need to search for a particular class which
    is the c input variable. Complicated is for the last column we are scraping where finding that class is a bit more complicated.
    """
    try:
        # Value should be converted to a number 
        if number == True:
            # Case 1: value in the next sibling
            if simple == True:
                return float(line[0].next_sibling.replace(",", "").replace("$", "")) 
            
            else: 
                # Case 2: When the website had a horizontal bar plot with two bars. c = h is the first bar and c = a is the second. 
                if complicated == False:
                    return float(line[0].next_sibling.next_sibling.find_all(class_ = c)[0].next_sibling.replace(",", "").replace(" years", "").replace("$", ""))
                # Case 3: Bar plot but with an extra line before the plot
                else: 
                    return float(line[0].find_all_next(class_ = c)[0].next_sibling.replace(",", "").replace(" years", "").replace("$", ""))

        else:
            return line[0].next_sibling
    except:
        return np.nan # If the line failed, then no value and want NA

We then created a demographic data frame with all the values we want to scrape stored as columns. For each one, one of the three cases is specified to the catch function and catch is also passed the find_all result of searching for the variable we want. Then the first few rows of the data frame are displayed.

In [14]:
demographic = pd.DataFrame({
        'Neighborhood': neighborhood_names, 
        'AreaSqMi': [catch(neigh.find_all(name = "b", string = "Area:"), True, True) for neigh in neighborhood_divs] ,
        'Population': [catch(neigh.find_all(name = "b", string = "Population:"), True, True) for neigh in neighborhood_divs], 
        'PeoplePerSqMi': [catch(neigh.find_all(name = "b", string = "Population density:"), True, False) for neigh in neighborhood_divs],
        'MedHouseholdIncome': [catch(neigh.find_all(name = "b", string = "Median household income in 2015: "), True, False) for neigh in neighborhood_divs],
        'MedRent': [catch(neigh.find_all(name = "b", string = "Median rent in in 2015: "), True, False) for neigh in neighborhood_divs],
        'Males': [catch(neigh.find_all(name = "b", string = "Male vs Females"), True, False) for neigh in neighborhood_divs],
        'Females': [catch(neigh.find_all(name = "b", string = "Male vs Females"), True, False, "a") for neigh in neighborhood_divs], 
        'MedAgeM': [catch(neigh.find_all(name = "b", string = "Median age"), True, False) for neigh in neighborhood_divs],
        'MedAgeF': [catch(neigh.find_all(name = "b", string = "Median age"), True, False, "a") for neigh in neighborhood_divs],
        'HousePrice': [catch(neigh.find_all(name = "b", string = "Average estimated value of detached houses in 2015 "), True, False, complicated = True) for neigh in neighborhood_divs],
})

demographic.head()
Out[14]:
AreaSqMi Females HousePrice Males MedAgeF MedAgeM MedHouseholdIncome MedRent Neighborhood PeoplePerSqMi Population
0 0.144 2461.0 1988926.0 3916.0 38.6 35.4 93901.0 1754.0 Alamo Square 44418.0 6379.0
1 0.124 944.0 455053.0 1177.0 38.9 41.2 104697.0 1937.0 Anza Vista 17080.0 2122.0
2 0.037 96.0 1101672.0 145.0 48.2 48.4 139017.0 2014.0 Aquatic Park 6487.0 241.0
3 0.055 774.0 NaN 1365.0 35.2 35.1 134523.0 2489.0 Baja Noe 38816.0 2141.0
4 0.048 312.0 NaN 377.0 42.6 38.6 84685.0 1591.0 Balboa Park 14350.0 690.0

So now we have lots of demographic data on gender, age, area, rent, income, population, and so on. All of this data is from 2015. Next, the neighborhood names from this data need to be cleaned to match up with our primary data set. The CombineNeighs() function was created to take two rows of the demographic data and combine them, either taking the average of the two columns (like median age for example) or sums the columns (like population for example). This is necessary in a few cases seen in the next cell. One example is where the demographic data had "Bayview District" and "Bayview Heights" while our primary street data only had "Bayview."

In [15]:
def CombineNeighs(name1, name2, newName, df):
    """
    This function takes in two neighborhood names of the demographic dataframe, combines them, 
    and changes the name.
    """
    # Grab the two rows we want:
    one = df.ix[df.Neighborhood == name1]
    two = df.ix[df.Neighborhood == name2]
    # Sum or average all the columns:
    one.loc[one.index, "AreaSqMi"] = (one['AreaSqMi'].values[0] + two['AreaSqMi'].values[0]) / 2
    one.loc[one.index, 'Females'] = one['Females'].values[0] + two['Females'].values[0]
    one.loc[one.index, 'HousePrice'] = (one['HousePrice'].values[0] + two['HousePrice'].values[0]) / 2
    one.loc[one.index, 'Males'] = one['Males'].values[0] + two['Males'].values[0]
    one.loc[one.index, 'MedAgeF'] = (one['MedAgeF'].values[0] + two['MedAgeF'].values[0]) / 2
    one.loc[one.index, 'MedAgeM'] = (one['MedAgeM'].values[0] + two['MedAgeM'].values[0]) / 2
    one.loc[one.index, 'MedHouseholdIncome'] = (one['MedHouseholdIncome'].values[0] + two['MedHouseholdIncome'].values[0]) / 2
    one.loc[one.index, 'PeoplePerSqMi'] = (one['PeoplePerSqMi'].values[0] + two['PeoplePerSqMi'].values[0]) / 2
    one.loc[one.index, 'Population'] = one['Population'].values[0] + two['Population'].values[0]   
    one.loc[one.index, 'Neighborhood'] = newName
    # Drop the old rows:
    df = df.drop(one.index)
    df = df.drop(two.index)
    # Add the new row:
    df = df.append(one)
    return df
In [17]:
# Clean up neighborhood names: make the demographic names match our data and shapefiles
demographic = CombineNeighs("Bayview District", "Bayview Heights", "Bayview", demographic)
demographic = CombineNeighs("Bernal Heights North", "Bernal Heights Park", "Bernal Heights1", demographic)
demographic = CombineNeighs("Bernal Heights South", "Bernal Heights", "Bernal Heights2", demographic)
demographic = CombineNeighs("Bernal Heights1", "Bernal Heights2", "Bernal Heights", demographic)
demographic = CombineNeighs("Balboa Park", "Balboa Terrace", "Balboa Terrace", demographic)
demographic = CombineNeighs("Aquatic Park", "Fort Mason", 'Aquatic Park / Ft. Mason', demographic)
demographic = CombineNeighs("Downtown", "Union Square", 'Downtown / Union Square', demographic)
demographic = CombineNeighs("Financial District", "Financial District South", "Financial District", demographic)

The next cleaning section includes cases where the name just needs to be changed.

In [18]:
demographic.loc[demographic.Neighborhood == "Buena Vista Park", "Neighborhood"] = "Buena Vista"
demographic.loc[demographic.Neighborhood == "Cayuga Terrace", "Neighborhood"] = "Cayuga"
demographic.loc[demographic.Neighborhood == "Ingleside Terrace", "Neighborhood"] = "Ingleside Terraces"
demographic.loc[demographic.Neighborhood == "Laurel Heights", "Neighborhood"] = 'Laurel Heights / Jordan Park'
demographic.loc[demographic.Neighborhood == "Lake Shore", "Neighborhood"] = 'Lakeshore'
demographic.loc[demographic.Neighborhood == "South Of Market", "Neighborhood"] = 'South of Market'
demographic.loc[demographic.Neighborhood == "North Waterfront", "Neighborhood"] = 'Northern Waterfront'
demographic.loc[demographic.Neighborhood == "St Marys Square", "Neighborhood"] = "St. Marys Park"
demographic.loc[demographic.Neighborhood == "Saint Francis Wood", "Neighborhood"] = "St. Francis Wood"
demographic.loc[demographic.Neighborhood == "Sea Cliff", "Neighborhood"] = "Seacliff"
demographic.loc[demographic.Neighborhood == "Parnassus", "Neighborhood"] = "Parnassus Heights"
demographic.loc[demographic.Neighborhood == "Park Merced", "Neighborhood"] = "Parkmerced"
demographic.loc[demographic.Neighborhood == "Mission District", "Neighborhood"] = "Mission"
demographic.loc[demographic.Neighborhood == "Marina District", "Neighborhood"] = "Marina"
demographic.loc[demographic.Neighborhood == "Mount Davidson Manor", "Neighborhood"] = "Mt. Davidson Manor"
demographic.loc[demographic.Neighborhood == "Lake", "Neighborhood"] = "Lake Street"

Finally, the demographic information was exported for use in the analysis notebook.

In [19]:
# Export demographic here, then import in the other notebook and merge before plotting.  
demographic.to_csv("demographic.csv")

III. Scraping - Events and Festivals

San Francisco Pride

The dates and estimated attendance of the SF Pride parade were located in a table on the Wikipedia page.

In [20]:
url_pride = "https://en.wikipedia.org/wiki/San_Francisco_Pride"
response = requests.get(url_pride)
response.raise_for_status

pride_bs = BeautifulSoup(response.text, 'lxml')

# Locate the right table
pride_table = pride_bs.find_all(name = "table", attrs={'class':'wikitable'})
In [21]:
#type(pride_table[0])
# The table is the first element

rows = pride_table[0].find_all(name = "tr")

# The first row is the header
#print rows[0]
#for colname in rows[0].find_all(name = "th"):
#    print colname.text

# Determine the order of the columns
colnames = [colname.text for colname in rows[0].find_all(name = "th")]
print colnames
[u'Year', u'Dates', u'Festival name', u'Theme', u'Estimated attendance', u'Notes']
In [22]:
# Extract the dates and attendance data from the appropriate columns
years = []
dates = []
attendance = []

for row in rows[1:]:
    cells = row.find_all(name = "td")
        
    years.append(cells[0].text)
    dates.append(cells[1].text)
    attendance.append(cells[4].text)
    
assert(len(years) == len(dates) == len(attendance))

We could then create a data frame from the extracted data.

In [23]:
pride = pd.DataFrame.from_dict({"year": years,
                                "date": dates,
                                "attendance": attendance})

pride.year = pd.to_numeric(pride.year)

# Remove unused years
pride = pride[pride.year > 2007]

The dates and attendance estimates required parsing to get compatible types for later use.

In [24]:
# Extract the start and end dates of the parade
startdates = [d[:-3] for d in pride.date]
enddates = [d[:4] + d[-2:] for d in pride.date]
startdatetimes = []
enddatetimes = []

for sdate, edate, year in zip(startdates, enddates, pride.year):
    startdatetimes.append(sdate + " " + str(year))
    enddatetimes.append(edate + " " + str(year))
    
pride["start_date"] = pd.to_datetime(startdatetimes)
pride["end_date"] = pd.to_datetime(enddatetimes)
#pride["datetimes"] = [pd.date_range(dt, periods=2) for dt in startdatetimes]

This function was used to parse the attendance strings and convert them to numeric data.

In [25]:
def parse_attendance(attendance_str):
    """
    Parse the attendance strings in the table in the Wikipedia page for the Pride parade. 
    Returns a float or None if the string cannot be parsed.
    """
    
    if "million" in attendance_str.lower():
        # Multiply by 1000000 and remove the string "million"
        return float(attendance_str.split(" ")[0]) * 1000000
    
    else:
        return None

        
pride["attendance_num"] = [parse_attendance(x) for x in pride.attendance]
pride = pride.reset_index()
pride
Out[25]:
index attendance date year start_date end_date attendance_num
0 38 1.2 million[25] June 28–29 2008 2008-06-28 2008-06-29 1200000.0
1 39 1.2 million[26] June 27–28 2009 2009-06-27 2009-06-28 1200000.0
2 40 1.2 million[27] June 26–27 2010 2010-06-26 2010-06-27 1200000.0
3 41 1 million[28] June 25–26 2011 2011-06-25 2011-06-26 1000000.0
4 42 June 23–24 2012 2012-06-23 2012-06-24 NaN
5 43 1.5 Million June 29–30 2013 2013-06-29 2013-06-30 1500000.0
6 44 1.7 million June 28–29 2014 2014-06-28 2014-06-29 1700000.0
7 45 1.8 million June 27–28 2015 2015-06-27 2015-06-28 1800000.0
8 46 TBD June 25-26 2016 2016-06-25 2016-06-26 NaN

Outside Lands Music and Arts Festival

The process of extracting the dates from the Wikipedia page for Outside Lands was more involved, because the data was less structured. Some of the dates were stored in heading tags, while others were in paragraphs.

In [26]:
url_ol = "https://en.wikipedia.org/wiki/Outside_Lands_Music_and_Arts_Festival"
response = requests.get(url_ol)
response.raise_for_status

ol_bs = BeautifulSoup(response.text, 'lxml')


# Years are in h3
h3 = ol_bs.find_all(name = "h3")

ol = []

for h in h3:
    span = h.find_all(name = "span", attrs={"class":"mw-headline"})
    
    # If there is a span in the h3
    if span:
        year = span[0].text 
        
        check_p = True
        
        for sibling in h.find_next_siblings(limit=5):
            
            # Days are in h4 or p
            if sibling.name == "h4":
                #print "h4" + "\t" + sibling.text
                ol.append([year, sibling.text.replace(u"\u2013", "-").replace("[edit]","")])
                # If an h4 was found, stop looking for p
                check_p = False
                
            elif sibling.name == "p" and check_p:
                #print check_p
                #print "p" + "\t" + sibling.text
                ol.append([year, sibling.text.replace(u"\u2013", "-")])
                # Formatting for 2011+ uses the date in a p tag
                if "August" in sibling.text:
                    break

The extracted date ranges then had to be separated into individual dates.

In [27]:
# Separate the date ranges and fix the formatting

ol2 = []
for year_and_date in ol:
    
    date_split = year_and_date[1].split()
    month = date_split[0]
    
    if len(date_split) > 2:
        # For format 'Friday August 22'
        # Keep the year
        year_and_date_new = [year_and_date[0]]
        # Discard the weekday
        year_and_date_new.extend(date_split[1:])
        ol2.append(year_and_date_new)
    else:
        # for format 'August 12-14'
        days = date_split[1].split("-")
        for day in days:
            # keep the year
            year_and_date_new = [year_and_date[0]]
            year_and_date_new.append(month)
            year_and_date_new.append(day)
            ol2.append(year_and_date_new)

The dates could then be converted to a datetime for use with pandas.

In [28]:
ol_dates = pd.to_datetime([" ".join(date) for date in ol2])

The dates were then exported for use in the analysis notebook.

In [29]:
ol_dates_df = pd.DataFrame({"Festival_Date": ol_dates})
ol_dates_df.to_csv("ol_dates.csv")
ol_dates_df.head()
Out[29]:
Festival_Date
0 2008-08-22
1 2008-08-23
2 2008-08-24
3 2009-08-28
4 2009-08-29

Further Preparation for Pride Data

For use in the analysis notebook, we calculated the number of requests in the surrounding neighborhoods on the days of the Pride parade.

In [31]:
# Get all request from June
streetJune = street.loc[street["Opened"].dt.month == 6]
streetJune.loc[:, "DateOpened"] = streetJune["Opened"].dt.date # Yells at you

# Neighborhoods found on http://www.sfpride.org/parade/
prideNeighs = ["South of Market", "Tenderloin", "Financial District", "Downtown / Union Square", "Civic Center"]
# Get June requests in the right neighborhoods
streetJune = streetJune.loc[streetJune.Neighborhood.isin(prideNeighs)]

# Get the count of requests per day
JuneDayReqs = streetJune.groupby(by = "DateOpened").count().CaseID

# Get just the date 
pride.loc[:, "StartNoTime"] = pride["start_date"].dt.date 
pride.loc[:,"EndNoTime"] = pride["end_date"].dt.date 

# Get requests per day
JuneRequests = pd.DataFrame({"ReqCount": JuneDayReqs})
JuneRequests = JuneRequests.reset_index()

# If the parade took place on that day or not
JuneRequests.loc[:,"Pride"] = [row in pride.StartNoTime.values for row in JuneRequests.DateOpened]

We then determined the number of requests on the days of the parade for each year by combining the dates of the parade with the requests on each day in June from above.

In [32]:
# Merge with start and end dates
pride_merged_start = JuneRequests.merge(right = pride, right_on="StartNoTime", left_on="DateOpened")
pride_merged_end = JuneRequests.merge(right = pride, right_on="EndNoTime", left_on="DateOpened")

# Remove extraneous columns
cols = ["DateOpened", "ReqCount", "attendance_num", "StartNoTime", "EndNoTime"]
pride_merged_start = pride_merged_start[cols]
pride_merged_end = pride_merged_end[cols]

# Combine requests on both days into one df
pride_merged = pd.concat([pride_merged_start, pride_merged_end])
pride_merged["Year"] = [t.year for t in pride_merged.StartNoTime]

# Get sum of requests over the two days of the parade each year
req_year = pride_merged.groupby(by="Year").sum()
req_year.reset_index(inplace=True)


pride_merged_final = pride_merged_start

# Extract year
pride_merged_final["Year"] = [t.year for t in pride_merged_final.StartNoTime]

# Merge with requests per year 
pride_merged_final = pride_merged_final.merge(req_year, on = "Year")

# Remove extraneous columns
pride_merged_final = pride_merged_final[["DateOpened", 
                                         "ReqCount_y", 
                                         "attendance_num_x", 
                                         "Year",
                                         "StartNoTime", 
                                         "EndNoTime"]]
pride_merged_final.head()
Out[32]:
DateOpened ReqCount_y attendance_num_x Year StartNoTime EndNoTime
0 2009-06-27 39 1200000.0 2009 2009-06-27 2009-06-28
1 2010-06-26 61 1200000.0 2010 2010-06-26 2010-06-27
2 2011-06-25 63 1000000.0 2011 2011-06-25 2011-06-26
3 2012-06-23 56 NaN 2012 2012-06-23 2012-06-24
4 2013-06-29 37 1500000.0 2013 2013-06-29 2013-06-30
In [33]:
pride_merged_final.to_csv("pride.csv")